-
Notifications
You must be signed in to change notification settings - Fork 537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove all time inefficient loops from mlbf #22861
base: master
Are you sure you want to change the base?
Conversation
dfd3c9c
to
42483e3
Compare
b5495d2
to
64858dc
Compare
64858dc
to
fe76d06
Compare
fe76d06
to
372b9e4
Compare
372b9e4
to
5c3819e
Compare
Thanks for adding the new tests, that looks good at first glance. I'm going to focus on #22828 so I'll let someone else take care of this PR.
I'd suggest filing a new issue for this, it could be about adding tests (and fixing more things revealed by the tests) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wdyt?
@@ -21,15 +22,47 @@ | |||
|
|||
|
|||
def ordered_diff_lists( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just use sets instead? They're much more efficient for comparison, and, to me, sticking to sets everywhere would be more straightforward.
We only need a specific order when we're writing to JSON, (because it doesn't support sets), so we could just sort the data just before writing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid chat GPT and myself disagree with you on this one. Here's the breakdown.
TLDR; we save no time (lose a little bit actually) using set difference. On top of that it is less space efficient, and as you said does not preserve order. I see no reason to prefer it based on merit.
In the current implementation we have:
- Building previous_set: This takes O(len(previous)) time.
- List comprehension:
- Iterates over current, which is O(len(current)).
- The x not in previous_set check is O(1) per element due to the constant-time lookup in a set.
Total Time Complexity: O(len(current) + len(previous))
Space Complexity: O(len(previous)) for previous_set.
In the set only option:
- Building set(current): O(len(current))
- Building set(previous): O(len(previous))
- Set difference set(current) - set(previous):
- Iterates over set(current), which is O(len(set(current))).
- For each element, checks if it's in set(previous), which is O(1) per element.
- Converting the set difference to a list: O(len(result))
Total Time Complexity: O(2 X len(current) + len(previous))
Space Complexity: O(len(current) + len(previous)) for both sets.
Comparison:
- Time Complexity: Both functions have the same time complexity of O(len(current) + len(previous)).
- Space Complexity: Function 1 uses less additional space because it only builds one set (previous_set), whereas Function 2 builds two sets (set(current) and set(previous)).
- Order Preservation: Function 1 preserves the order of elements in current, while Function 2 does not, because sets are unordered collections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're just considering this function chatGPT may have a point, but I meant use sets throughout the code in this file; then we can use more efficient set arithmetic and drop this function. (And do sorted(...)
only when we're writing json)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I wasn't considering just this function and it doesn't make a difference in any case. Using set arithmetic is going to result in more space usage and the same amount of time usage. It simply does not benefit us in this use outside of perhaps looking nicer.
Additionally, one of the issues we had with the approach of sorting before writing is the built in sort method didn't seem to work predictably causing really flaky tests.. I think we'd have to figure that out if we wanted to switch to sets.
src/olympia/blocklist/mlbf.py
Outdated
for guid, version, id, block_type in get_all_blocked_items(): | ||
_blocked_version_ids.append(id) | ||
_blocked[block_type].append((guid, version)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for guid, version, id, block_type in get_all_blocked_items(): | |
_blocked_version_ids.append(id) | |
_blocked[block_type].append((guid, version)) | |
for guid, version_string, version_id, block_type in get_all_blocked_items(): | |
_blocked_version_ids.append(version_id) | |
_blocked[block_type].append((guid, version_string)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I do wonder if it's more efficient to generate _blocked_version_ids
and _blocked
with list/set generator syntax, rather than loop through with a for loop and append one at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well this loop will be an O(n) over all blocked items.
And there will be 2 O(1) appends per iteration.. so still O(n)
Do you have an example of something that would be more efficient? I thnk this doesn't look as pretty but is actually the efficient way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking more that, in python anyway, it's generally more efficient to build a list with a generator than to build it by iterating and repeatedly appending items (because it needs resizing and reallocating more). I don't have stats to back-up the difference between generating two lists vs. iterating over the list once though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah IDK about the perf difference there, but I imagine it would not be significant since we are comparing a slightly slower single iteration versus a slightly faster double iteration. We could give it a try and see if there is a big diff.
5c3819e
to
26f0cf2
Compare
Remove test Remove all time inefficient loops from mlbf
26f0cf2
to
0406b1a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... yeah? I remain in the pro-set camp though. And I'd like someone who's been more involved with the intricacies of the current soft-blocking work (e..g @willdurand) to give this a once-over too.
(I understand this may not be merged for a while anyway, if we're soft-freezing blocklisting changes for QA)
@@ -21,15 +22,47 @@ | |||
|
|||
|
|||
def ordered_diff_lists( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're just considering this function chatGPT may have a point, but I meant use sets throughout the code in this file; then we can use more efficient set arithmetic and drop this function. (And do sorted(...)
only when we're writing json)
src/olympia/blocklist/mlbf.py
Outdated
for guid, version, id, block_type in get_all_blocked_items(): | ||
_blocked_version_ids.append(id) | ||
_blocked[block_type].append((guid, version)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking more that, in python anyway, it's generally more efficient to build a list with a generator than to build it by iterating and repeatedly appending items (because it needs resizing and reallocating more). I don't have stats to back-up the difference between generating two lists vs. iterating over the list once though.
@@ -233,16 +262,17 @@ def generate_and_write_filter(self): | |||
|
|||
log.info(json.dumps(stats)) | |||
|
|||
@lru_cache(maxsize=128) # noqa: B019 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't think we're ever going to call the function 128 times 🙃 - it's typically like, 3 times, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be more than three but yeah probably not 128. I think I copy pasted that from docs.. maybe there is a sensible default.
# This test is very slow, so we increase the timeout to 180 seconds. | ||
@pytest.mark.timeout(120) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
120 or 180?
Just to make it clear: because QA is testing things right now and this is the last push of the year, we'd like to avoid big changes to blocklisting, so this should NOT land in this sprint, even if approved. |
Fixes: mozilla/addons#15170
Description
MLBFDatabaseLoader
to module scope functions for easier mockingMLBFDatabaseLoader
to 1not_blocked_items
with an O(1) set instead of listgenerate_diffs
to accept a block_type and only diff one set of data at a time.generate_diffs
to ensure computation is executed only one time per instance + block_type + previous_mlbftest_mlbf
markContext
This patch includes logic that should reduce the number of long loops required to process our mlbf data sets and determine the cache and diff of blocks across different scenarios.
Testing
Run the mlbf suite
This will run all of the mlbf related tests. Expect everything passes
Test the timeout for the slow test
pytest.mark.timeout(120)
reducing the time to 3 (or some low numberFAILED src/olympia/blocklist/tests/test_cron.py::TestUploadToRemoteSettings::test_large_data_set - Failed: Timeout >10.0s
Test in local environment
IF you want to test with a real data set you will have to create a large amount of blocks and execute the cron
Code snippet
Execute this code. You will need to execute many time or bump `1_000 up to between 1 and 2 million.. grab a coffee this will take a while.
After that is finished, execute
./manage.py cron upload_mlbf_to_remote_settings
and expect it to take up to 2 minutes to complete.Checklist
#ISSUENUM
at the top of your PR to an existing open issue in the mozilla/addons repository.